1 Background

The outbreak of the novel Corona virus disease 2019 (COVID-19) was declared a public health emergency of international concern by the World Health Organization (WHO) on January 30, 2020. Upwards of 112 million cases have been confirmed worldwide, with nearly 2.5 million associated deaths. Within the US alone, there have been over 500,000 deaths and upwards of 28 million cases reported. Governments around the world have implemented and suggested a number of policies to lessen the spread of the pandemic, including mask-wearing requirements, travel restrictions, business and school closures, and even stay-at-home orders. The global pandemic has impacted the lives of individuals in countless ways, and though many countries have begun vaccinating individuals, the long-term impact of the virus remains unclear.

The impact of COVID-19 on a given segment of the population appears to vary drastically based on the socioeconomic characteristics of the segment. In particular, differing rates of infection and fatalities have been reported among different racial groups, age groups, and socioeconomic groups. One of the most important metrics for determining the impact of the pandemic is the death rate, which is the proportion of people within the total population that die due to the the disease.

We assemble this dataset for our research with the goal to investigate the effectiveness of lockdown on flattening the COVID curve. We provide a portion of the cleaned dataset for this case study.

There are two main goals for this case study.

  1. We show the dynamic evolvement of COVID cases and COVID-related death at state level.
  2. We try to figure out what county-level demographic and policy interventions are associated with mortality rate in the US. We try to construct models to find possible factors related to county-level COVID-19 mortality rates.

Remark: please keep track with the most updated version of this write-up.

2 Data Summary

The data comes from several different sources:

  1. County-level infection and fatality data - This dataset gives daily cumulative numbers on infection and fatality for each county.
  2. County-level socioeconomic data - The following are the four relevant datasets from this site.
    1. Income - Poverty level and household income.
    2. Jobs - Employment type, rate, and change.
    3. People - Population size, density, education level, race, age, household size, and migration rates.
    4. County Classifications - Type of county (rural or urban on a rural-urban continuum scale).
  3. Intervention Policy Data - This dataset is a manually compiled list of the dates that interventions/lockdown policies were implemented and lifted at the county level.

3 EDA

In this case study, we use the following three cleaned data:

  • covid_county.csv: County-level socialeconomic information that combines the above-mentioned 4 datasets: Income (Poverty level and household income), Jobs (Employment type, rate, and change), People (Population size, density, education level, race, age, household size, and migration rates), County Classifications
  • covid_rates.csv: Daily cumulative numbers on infection and fatality for each county
  • covid_intervention.csv: County-level lockdown intervention.

Among all data, the unique identifier of county is FIPS.

The cleaning procedure is attached in Appendix 2: Data cleaning You may go through it if you are interested or would like to make any changes.

First read in the data.

3.1 Understand the data

The detailed description of variables is in Appendix 1: Data description. Please get familiar with the variables. Summarize the two data briefly.

3.2 COVID case trend

It is crucial to decide the right granularity for visualization and analysis. We will compare daily vs weekly total new cases by state and we will see it is hard to interpret daily report.

  1. Plot new COVID cases in NY, WA and FL by state and by day. Any irregular pattern? What is the biggest problem of using single day data?

  2. Create weekly new cases per 100k weekly_case_per100k. Plot the spaghetti plots of weekly_case_per100k by state. Use TotalPopEst2019 as population.

  3. Summarize the COVID case trend among states based on the plot in ii). What could be the possible reasons to explain the variabilities?

  4. (Optional) Use covid_intervention to see whether the effectiveness of lockdown in flattening the curve.

## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.

## `summarise()` has grouped output by 'State'. You can override using the `.groups` argument.

  1. For each month in 2020, plot the monthly deaths per 100k heatmap by state on US map. Use the same color range across months. (Hints: Set limits argument in scale_fill_gradient() or use facet_wrap(); use lubridate::month() and lubridate::year() to extract month and year from date; use tidyr::complete(state, month, fill = list(new_case_per100k = NA)) to complete the missing months with no cases.)
## `summarise()` has grouped output by 'month'. You can override using the `.groups` argument.

  1. (Optional) Use plotly to animate the monthly maps in i). Does it reveal any systematic way to capture the dynamic changes among states? (Hints: Follow Appendix 3: Plotly heatmap:: in Module 6 regularization lecture to plot the heatmap using plotly. Use frame argument in add_trace() for animation. plotly only recognizes abbreviation of state names. Use unique(us_map(regions = "states") %>% select(abbr, full)) to get the abbreviation and merge with the data to get state abbreviation.)

4 COVID factor

We now try to build a good parsimonious model to find possible factors related to death rate on county level. Let us not take time series into account for the moment and use the total number as of Feb 1, 2021.

  1. Create the response variable total_death_per100k as the total of number of COVID deaths per 100k by Feb 1, 2021. We suggest to take log transformation as log_total_death_per100k = log(total_death_per100k + 1). Merge total_death_per100k to county_data for the following analysis.

  2. Select possible variables in county_data as covariates. We provide county_data_sub, a subset variables from county_data, for you to get started. Please add any potential variables as you wish.

  1. Report missing values in your final subset of variables.

  2. In the following anaylsis, you may ignore the missing values.

  1. Use LASSO to choose a parsimonious model with all available sensible county-level information. Force in State in the process. Why we need to force in State? You may use lambda.1se to choose a smaller model.

## Anova Table (Type II tests)
## 
## Response: log_death_rate
##                         Sum Sq   Df F value  Pr(>F)    
## State                      500   48   16.67 < 2e-16 ***
## PovertyAllAgesPct            0    1    0.25 0.61737    
## PerCapitaInc                 1    1    0.90 0.34204    
## PctEmpConstruction          28    1   45.44 1.9e-11 ***
## PctEmpMining                 7    1   11.05 0.00090 ***
## PctEmpAgriculture           79    1  126.72 < 2e-16 ***
## PctEmpManufacturing          1    1    0.93 0.33503    
## PopDensity2010               5    1    7.44 0.00641 ** 
## Age65AndOlderPct2010        12    1   19.44 1.1e-05 ***
## Under18Pct2010              37    1   59.14 2.0e-14 ***
## Ed3SomeCollegePct            9    1   14.52 0.00014 ***
## Ed5CollegePlusPct            6    1    9.11 0.00256 ** 
## NetMigrationRate1019        13    1   20.53 6.1e-06 ***
## NaturalChangeRate1019       12    1   19.65 9.6e-06 ***
## WhiteNonHispanicPct2010      2    1    2.99 0.08371 .  
## HispanicPct2010             10    1   16.39 5.3e-05 ***
## Type_2015_Update             3    1    5.08 0.02425 *  
## Residuals                 1899 3040                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Use Cp or BIC to fine tune the LASSO model from iii). Again force in State in the process.

## Anova Table (Type II tests)
## 
## Response: log_death_rate
##                         Sum Sq   Df F value  Pr(>F)    
## State                      500   48   16.67 < 2e-16 ***
## PovertyAllAgesPct            0    1    0.25 0.61737    
## PerCapitaInc                 1    1    0.90 0.34204    
## PctEmpConstruction          28    1   45.44 1.9e-11 ***
## PctEmpMining                 7    1   11.05 0.00090 ***
## PctEmpAgriculture           79    1  126.72 < 2e-16 ***
## PctEmpManufacturing          1    1    0.93 0.33503    
## PopDensity2010               5    1    7.44 0.00641 ** 
## Age65AndOlderPct2010        12    1   19.44 1.1e-05 ***
## Under18Pct2010              37    1   59.14 2.0e-14 ***
## Ed3SomeCollegePct            9    1   14.52 0.00014 ***
## Ed5CollegePlusPct            6    1    9.11 0.00256 ** 
## NetMigrationRate1019        13    1   20.53 6.1e-06 ***
## NaturalChangeRate1019       12    1   19.65 9.6e-06 ***
## WhiteNonHispanicPct2010      2    1    2.99 0.08371 .  
## HispanicPct2010             10    1   16.39 5.3e-05 ***
## Type_2015_Update             3    1    5.08 0.02425 *  
## Residuals                 1899 3040                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. If necessary, reduce the model from iv) to a final model with all variables being significant at 0.05 level. Are the linear model assumptions all reasonably met?
## Anova Table (Type II tests)
## 
## Response: log_death_rate
##                         Sum Sq   Df F value  Pr(>F)    
## State                      508   48   16.93 < 2e-16 ***
## PctEmpConstruction          36    1   57.85 3.8e-14 ***
## PctEmpMining                11    1   17.74 2.6e-05 ***
## PctEmpAgriculture           91    1  145.97 < 2e-16 ***
## PopDensity2010               4    1    6.58   0.010 *  
## Age65AndOlderPct2010        11    1   18.25 2.0e-05 ***
## Under18Pct2010              38    1   60.85 8.4e-15 ***
## Ed3SomeCollegePct           12    1   19.71 9.3e-06 ***
## Ed5CollegePlusPct           31    1   49.64 2.3e-12 ***
## NetMigrationRate1019        15    1   23.38 1.4e-06 ***
## NaturalChangeRate1019       12    1   19.44 1.1e-05 ***
## WhiteNonHispanicPct2010      4    1    5.73   0.017 *  
## HispanicPct2010             10    1   16.61 4.7e-05 ***
## Type_2015_Update             3    1    4.45   0.035 *  
## Residuals                 1900 3043                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning: not plotting observations with leverage one:
##   54

  1. It has been shown that COVID affects elderly the most. It is also claimed that the COVID death rate among African Americans and Latinxs is higher. Does your analysis support these arguments?

  2. Based on your final model, summarize your findings. In particular, summarize the state effect controlling for others. Provide intervention recommendations to policy makers to reduce COVID death rate.

##       state      coef
##  1: StateND  0.896199
##  2: StateSD  0.789720
##  3: StateMT  0.533683
##  4: StateDC  0.424230
##  5: StateLA  0.352791
##  6: StateMS  0.272436
##  7: StateIA  0.265483
##  8: StateAZ  0.257350
##  9: StateNJ  0.251688
## 10: StateWY  0.181962
## 11: StateIL  0.181870
## 12: StateTX  0.153056
## 13: StateCT  0.111730
## 14: StateAR  0.056081
## 15: StateGA  0.053575
## 16: StateTN  0.052008
## 17: StateMA  0.027635
## 18: StatePA  0.000915
## 19: StateFL -0.006695
## 20: StateSC -0.020896
## 21: StateDE -0.075836
## 22: StateIN -0.079801
## 23: StateMD -0.133003
## 24: StateMI -0.152742
## 25: StateWI -0.153404
## 26: StateMN -0.185672
## 27: StateRI -0.226424
## 28: StateID -0.314283
## 29: StateCO -0.368938
## 30: StateNC -0.376737
## 31: StateOK -0.379630
## 32: StateMO -0.384755
## 33: StateNE -0.407081
## 34: StateNY -0.409466
## 35: StateVA -0.511923
## 36: StateOH -0.543226
## 37: StateNV -0.613914
## 38: StateKS -0.632561
## 39: StateNM -0.671560
## 40: StateWV -0.706377
## 41: StateKY -0.756633
## 42: StateCA -0.820722
## 43: StateNH -0.849123
## 44: StateWA -0.870822
## 45: StateOR -0.873242
## 46: StateUT -1.075081
## 47: StateME -1.394051
## 48: StateVT -2.158665
##       state      coef

  1. What else can we do to improve our model? What other important information we may have missed?